Add Word-guess NemoGym GRPO training #1903
Draft
yyu22 wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
Draft
Conversation
Signed-off-by: root <yayu@nvidia.com>
… for reasoning models - Add grpo_wordle_nemotron_nano_v2_9b.yaml config for NemoGym Wordle training - Fix _replace_prefix_tokens crash when chat templates strip reasoning tokens from prior assistant messages (e.g., Nemotron's <think>...</think> stripping) - Fix _postprocess_nemo_gym_to_nemo_rl_result contiguity assertion for the same reasoning token stripping issue Signed-off-by: root <yayu@nvidia.com>
|
related: #1812 id like to disable forcing token-level on policy eg for agents with context mgmt, but feels like we shouldnt just quietly fallback to off policy, it should be a cfg option at least i think rather than disabling asserts for replace prefix tokens we should just do this for now https://docs.nvidia.com/nemo/gym/latest/tutorials/nemo-rl-grpo/single-node-training.html#configure-the-chat-template until we test disabling this thorough and add a cfg |
Revert workaround changes to nemo_gym.py and vllm_worker_async.py. Instead, add a custom Nemotron chat template that preserves <think> tokens in prior assistant messages (no stripping), which keeps token alignment consistent across turns for _replace_prefix_tokens. Signed-off-by: root <yayu@nvidia.com>
Signed-off-by: root <yayu@nvidia.com>
Signed-off-by: root <yayu@nvidia.com>
Signed-off-by: root <yayu@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dependencies
Bug Fix: Token alignment with reasoning-stripping chat templates
Models like Nemotron Nano 9B v2 have chat templates that strip ... from prior assistant messages when re-rendering for subsequent turns. This causes two assertion failures during NemoGym multi-turn
tool-calling training:
Fix: Fall back to template token IDs when alignment fails instead of crashing. Note: this causes token duplication in affected samples, which may slightly impact training quality. The proper fix would be to strip
thinking tokens from generation_token_ids before recording, matching what the template does on re-render.
Training Setup
Generate Wordle data (in Gym repo)
Run training